Introduction

Legislation has been passed and social norms challenged to help level the playing field for different races and genders. Yet, according to Williams (Williams, 1987), “risk-averse employers believe and act as if black workers are on average less productive than their white counterparts; employers thus hire blacks at a wage discount or not at all.” Williams (Williams, 1987) goes on to say that there is a second case that “presumes blacks and white are equally productive on average, but black display a greater variance in ability; hence risk-avers employers’ hiring decision could precipitate a racial wage gap.” Due to this, it holds that business owners are more likely to make productivity and skill-based decisions based on race rather than incur the cost of acquiring and interpreting statistically significant data. This is witnessed by looking at historical wage data amongst seemingly disparate groups to see how over time, wages have increased but not at the same rate. The wage gap remains.

Literature Review

\(y=f(x)\) is a common expression of the idea that a given output is a function of all the inputs. This is a deceiving simple concept but an important one that has made research into the wage gap difficult. There are numerous published articles that try to pinpoint the reason a wage gap exists among multiple diversity categories. So many in fact that some factor-combinations yield no gap. Take for example, the May 2006 work published by Dan Black.

We find that these wage differences generally appear to be the consequence of differences in premarket factors: age, the levels and types of education, and English fluency and/or assimilation. In particular, among college-educated men who speak English at home, our estimated wage gaps are very close to 0 for Hispanic and Asian men. Similarly, the unexplained wage gap is approximately 0 for black men with college-educated parents not born in the South. We provide fragmentary evidence that the unexplained gap for other black men - Southern-born men and those born elsewhere to poorly educated parents - is related to the generally poor quality of education afforded these men at the precollege and college levels.

Which is in direct contrast to a Matt Huffman’s 2004 work where he found evidence of increasing tendencies toward racial discrimination as the job stakes are raised (high-status jobs).

The majority of published works we reviewed found evidence to support the claim that a wage gap exists between underrepresented populations and the dominate population. Studies ranged from scopes as broad as the work of Oliver and Shapiro in 2006 that looked at total debt-to-asset ratios to scopes as narrow as the 2010 work of Broyles and Fenner looking specifically at the field of STEM. What it comes down to is while there is a lot of information published, there are not many works reproducing or confirming those results. We set out to find the answer ourselves. We found that despite all the disparate studies, the overwhelming find is the gap exists. Our findings also confirm this.

Theoretical Analysis

Hyppothesises

White

\[H_0 : White Income \propto AllIncome\]

Our null hypothesis is that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of white individuals who are 16 and older in the United States.

\[H_A : White Income \not\propto AllIncome\]

Black

Our alternate hypothesis is that the median income of white individuals aged 16 and older in the United States is significantly lower than the median income of individuals aged 16 and older.

\[H_0 : Black Income \propto AllIncome\]

Our null hypothesis is that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of black individuals who are 16 and older in the United States.

\[H_A : Black Income \not\propto AllIncome\]

Hispanic

Our alternate hypothesis is that the median income of hispanic individuals aged 16 and older in the United States is significantly lower than the median income of individuals aged 16 and older.

\[H_0 : Hispanic Income \propto AllIncome\]

Our null hypothesis is that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of hispanic individuals who are 16 and older in the United States.

\[H_A : Hispanic Income \not\propto AllIncome\]

Our alternate hypothesis is that the median income of black individuals aged 16 and older in the United States is significantly lower than the median income of individuals aged 16 and older.

Models

Our model is simple, mod1 predicting Median only depends on the Date. mod2 preforms the \(\log(Median)\) as it makes the data more linear as monetary values tends to fit log regressions much better than linear ones. These will represent \(AllIncome\). modBlack includes the Black factor and will represent the Black population and non-black’s. The same is done with modWhite and modHispanic with the white’s and the hispanics respectively.

\[mod1 : Median=\beta_1Date+\beta_0+e\] \[mod2 : \log(Median)=\beta_1Date+\beta_0+e\] \[modWhite : \log(Median)=\beta_1Date*White+\beta_0+e\] \[modBlack : \log(Median)=\beta_1Date*Black+\beta_0+e\] \[modBlack : \log(Median)=\beta_1Date*Hispanic+\beta_0+e\]

Empirical Analysis

Data

About the Data

We have two sources of data, one from U.S. Bureau of Labor Statistics (BLS) and the majority of data from Economic Policy Institute (EPI).

BLS maintains a data set called cpsaat, this data summaries the wage earnings per type of job, based on race and gender. To access the data in R we use a curl_download to retrieve the .xlsx file off the internet. To read the file we use the function readxl::read_excel.

EPI hosts a lot of data on wage statistics including, minimum wage, the participation, and earnings of each race, gender, education level, and much more. Due to the way EPI presents the data, it cannot be downloaded with curl. Instead, I have accessed the data with the package epidata, this simple package interfaces with EPI so that you don’t have to manually download the data. EPI does not contain individual observations for wage, instead it provides 2 summarizations of the data grouped by race, age, gender, and education. This is the median, 50% of people make more and 50% of people make less than this value. The other one is mean, or they call average, this is the sum of wages added up and divided by the amount. \[\bar x=\frac{\sum_{i=0}^{n-1} x_i}{n}\]

To reduce the effect of the highest earners we will be using the median, like they use in the housing market as a high outlier will only add one rather than a lot more.

Clean Data

As with most data, it will have to be cleaned. This includes pivoting the tibble into a longer tibble, as it will work better for ggplot2. This current format is called wide format as it has many columns. To fix this we can convert it into long format, as there are many rows, with pivot_longer. When we do this sometimes the new column we create contains more than one value, to remedy this issue we can use seperate and mutate if necessary to get the values in the right column. Another inconsistancy we should be aware of is that the currency values are in different years, not a large difference, but something that should be corrected.

Minimum_wage has data in terms of 2018, the other data is in 2019 USD. As it will be easiest and the latest data, we will be using 2019. Although small, there will be a difference and we need to adjust for inflation. The package priceR allows us to convert those monetary values into other ones using online inflation data.

As the data was imported with epidata, the column names have been changed from what the csv has. So we need to fix that to conform to consistency. For this project the names will be captained.

if(!dir.exists("../data"))dir.create("../data")

cpsaat11%>%
    write_csv("../data/cpsaat11.csv")

Minimum_wage%>%
    write_csv("../data/Minimum_wage.csv")

Participation%>%
    write_csv("../data/Participation.csv")

Wages%>%
    write_csv("../data/Wages.csv")

Methodology

Using the Data

After acquiring all the data, the next step was cleaning all the data. Once the data is cleaned and reorganized the next was filtering it for all the different hypothesis. Once the data was separated into the different races and age groups the data is then represented in the form of graphs demonstrating the how the different races compare to each other on a median income basis. The graphs however do not demonstrate our hypothesis well enough.

Following the graphs is chow testing. After preforming the first chow test we can reject our null hypothesis that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of white individuals who are 16 and older in the United States. This is due to the p-value being less than .01.

The second chow test also results in a rejection of our null hypothesis demonstrating there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of black individuals who are 16 and older in the United States. Since the p-value is less than 0.01, we can conclude that the median income of black individuals aged 16 and older in the United States is significantly lower than the median income of individuals aged 16 and older.

The final chow test results again in a rejection of the null hypothesis. This claims that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of Hispanic individuals who are 16 and older in the United States. We can then conclude that the median income of Hispanic individuals aged 16 and older in the United States is significantly lower than the median income of individuals aged 16 and older. Following the chow testing concludes our hypothesis on the data that whites make more on a median income basis than average United States residents and that black people and Hispanics also make less median income than the average United States resident.

Open Sorce

This document is created open source, meaning anyone can view the code, comment on it, and suggest modifications to the documents. As such, the RMarkdown used to create these documents, both HTML and PDF are available at https://github.com/zekrom-vale/IncomeOnRace as well as all modifications in the repository. Both versions are are created with the same multi-file RMarkdown using knitr to knit statically with PDF or interactively with HTML. The most up to date version of the PDF and HTML versions will be available there. The GitHub repository also archives the data we used after it was cleaned in CSV format.

HTML Version

An interactive HTML document / web site is available at https://zekrom-vale.github.io/IncomeOnRace/ and is recomended over the static PDF version. The HTML version uses the package plotly that allows users to zoom, filter, play animations, and inspect data in graphs. The static PDF version uses ggplot2 that does not support interactivity.

Results

Estimators

WagesAll=Wages%>%
    filter(is.na(Race),is.na(Gender))%>%
    # Convert Year into date
    mutate(Date=lubridate::as_date(glue("{Date}-1-1")))%>%
    select(Date, Median)
# WageTs=ts(WagesAll, start = min(WagesAll$Date), end = max(WagesAll$Date), frequency = 1)
# acf(WageTs[, "Median"])
WagesAll%>%
    plot_acf_diagnostics(
        .value=Median,
        .date_var = Date,
        .interactive = !(isKnit()&&knitr::is_latex_output()),
        # Use years as the lag interval so it's not confusing.
        .lags=glue("{max(Wages$Date)-min(Wages$Date)} years")
)

Lags in years There is a lot of autocorelation so we update the models to include a lag of the dependent variable. \(\beta Median_{t-1}\)

These models now become: \[mod0 : Median_t=\beta_3Date+\beta_2Median_{t-2}+\beta_1Median_{t-1}+\beta_0+e\] \[mod1 : Median=\beta_3Date+\beta_2Median_{t-2}+\beta_1Median_{t-1}+\beta_0+e\] \[mod2 : \log(Median)=\beta_3Date+\beta_2Median_{t-2}+\beta_1Median_{t-1}+\beta_0+e\] \[modWhite : \log(Median)=\beta_3Date*White+\beta_2Median_{t-2}+\beta_1Median_{t-1}+\beta_0+e\] \[modBlack : \log(Median)=\beta_3Date*Black+\beta_2Median_{t-2}+\beta_1Median_{t-1}+\beta_0+e\] \[modBlack : \log(Median)=\beta_3Date*Hispanic+\beta_2Median_{t-2}+\beta_1Median_{t-1}+\beta_0+e\] Were \(White\), \(Black\), and \(Hispanic\) are binary features based on \(Race\).

mod0=dynlm(Median~Date+stats::lag(Median, -1), data = Wages)

bgtest(mod0, order = 1, type = "F", fill = NA)

    Breusch-Godfrey test for serial correlation of order up to 1

data:  mod0
LM test = 2.4598, df1 = 1, df2 = 559, p-value = 0.1174
WagesAll%>%
    ggplot(aes(x=Median, y=stats::lag(Median, k=-2)))+
    geom_point()

mod1=lm(Median~Date+stats::lag(Median, -1), data = Wages)
mod2=lm(log(Median)~Date+stats::lag(Median, -1), data = Wages)
WagesAll%>%
    ggplot(aes(x=Date, y=Median))+
    geom_point()

chow=function(racestr){
    WagesRace=Wages%>%
        mutate(R=if_else(Race==racestr, 1, 0))%>%
        filter(!is.na(Race),is.na(Gender))
    
    mod2=lm(log(Median)~Date,
                    data=WagesRace
    )
    
    
    modRace=lm(log(Median)~Date*R,
                    data=WagesRace
    )
    
    stargazer(mod2, modRace,
        header=FALSE,
        type=knittype,
      title="Model comparison, 'wage' equation",
      keep.stat="n",digits=2, single.row=TRUE,
      intercept.bottom=FALSE
    )
    
    anova(mod2, modRace)%>%
        kable()
}
chow("white")
Res.Df RSS Df Sum of Sq F Pr(>F)
139 2.5887721 NA NA NA NA
137 0.2989179 2 2.289854 524.7428 0

After performing a chow test we can reject our null hypothesis that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of white individuals who are 16 and older in the United States, since our p-value is less than 0.01. We conclude that the median income of white individuals aged 16 and older in the United States is significantly higher than the median income of individuals aged 16 and older.

chow("black")
Res.Df RSS Df Sum of Sq F Pr(>F)
139 2.588772 NA NA NA NA
137 2.361682 2 0.2270902 6.586694 0.0018567

After performing another chow test we can reject our null hypothesis that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income of black individuals who are 16 and older in the United States, since our p-value is less than 0.01. We conclude that the median income of black individuals aged 16 and older in the United States is significantly lower than the median income of individuals aged 16 and older.

chow("hispanic")
Res.Df RSS Df Sum of Sq F Pr(>F)
139 2.588772 NA NA NA NA
137 1.498359 2 1.090413 49.85007 0

After performing our final chow test, we can reject our null hypothesis that there is no significant difference between the median income of individuals aged 16 and older in the United States and the median income.

Some limitations to the experiment are the data collection. This is because we are unable to collect everyone’s income in the united states to test this. However, the data we do have gives a good representation of the income of people as we currently know it in the United States. Another major issue would be the voluntary data used. People who volunteer to give out this data may not participate due to their current financial status. This would skew the data and ultimately change the outcome.

Conclusion

References

Williams, R.M., 1987. Capital, competition, and discrimination: A reconsideration of racial earnings inequality. Review of Radical Political Economics 19, 1–15.